feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip by sieniven · Pull Request #210 · okx/xlayer-reth

sieniven · 2026-03-20T05:22:33Z

Summary

The PR ensures zero re-org on subscribed flashblocks RPC nodes by ensuring p2p gossip to follower sequencers is a blocking successful operation first, before gossiping the flashblocks to RPC nodes. This allows for atomicity on flashblocks replay during sequencer switches / failures of the leader sequencer, to replay consistent gossiped flashblocks across RPC nodes - ensuring flashblocks re-org is guaranteed protected on RPC nodes. Note that the websocket publishing logic is shifted into the new broadcast submodule of the xlayer-builder crate.

In addition, on p2p node:

On broadcast failures, logic fixed to immediately re-establish a stream if the tcp connection is still alive
Retry interval reduced to 1s

Reorg considerations

In the case where p2p broadcast fails on the transport layer, the blocking broadcast to peers operation is only on the application level which may still cause a potential reorg.

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sieniven · 2026-03-22T05:58:28Z

@claude review this PR

github-actions · 2026-03-22T05:58:47Z

Claude finished @sieniven's task in 3m 31s —— View job

PR Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

Gather context and understand the PR
Read changed files
Analyze logic correctness and atomicity guarantees
Check for bugs, edge cases, and error handling
Provide review feedback

Summary

Good architectural direction — removing ws_pub from the builder and consolidating publish responsibility into the p2p Node simplifies the builder and keeps WS publishing coordinated with p2p gossip. The Message trait → concrete Message enum refactor also reduces unnecessary generics. However, there are several issues worth addressing:

🔴 Critical: WS publishes before p2p gossip in `Node::run()` — opposite of the stated guarantee

crates/builder/src/p2p/mod.rs, lines ~157–162:

if let Message::OpFlashblockPayload(ref fb_payload) = message {
    let flashblock_byte_size = ws_pub.publish(fb_payload)?;   // ← WS first
    metrics.flashblock_byte_size_histogram.record(flashblock_byte_size as f64);
}
if let Err(e) = outgoing_streams_handler.broadcast_message(message).await {  // ← p2p second
    warn!(target: "flashblocks-p2p", "...");
}

The PR description states: "p2p gossip to follower sequencers is a blocking successful operation first before gossiping the flashblocks to RPC nodes."

But the code publishes to WebSocket subscribers before broadcast_message() to peers. This means the leader's own WS subscribers (RPC nodes) receive the flashblock before follower sequencers do, which is the exact re-org risk the PR claims to fix. The broadcast_message() call should come first, and ws_pub.publish() only after it returns (or at least returns without error).

Fix this →

🟡 Medium: `ws_pub.publish()?` kills the p2p node if broadcast channel has no receivers

crates/builder/src/p2p/mod.rs:

let flashblock_byte_size = ws_pub.publish(fb_payload)?;

WebSocketPublisher::publish() calls self.pipe.send(utf8_bytes) (a broadcast::Sender), which returns Err when there are no active receivers. The initial receiver created in WebSocketPublisher::new() is held by the listener_loop task, so in normal operation this is unlikely to fire. However, if the listener_loop task ever panics or exits unexpectedly, subsequent calls to ws_pub.publish() will kill the p2p node entirely via ?.

The previous code in builder.rs handled this with map_err(PayloadBuilderError::other)?, which only terminated the current build job (restartable), not the p2p node service. The handler's follower path (in handler.rs) correctly uses if let Err(e) = ws_pub.publish(...) { warn!(...) }. The Node path should follow the same pattern.

Fix this →

🟡 Medium: `flashblock_byte_size_histogram` metric silently drops when p2p is disabled

crates/builder/src/flashblocks/builder.rs (removed) vs crates/builder/src/p2p/mod.rs (added):

Previously, ctx.metrics.flashblock_byte_size_histogram.record(...) was called directly in the builder — always, regardless of p2p state. Now it's only recorded inside Node::run(), which only runs when p2p_enabled = true. Deployments with p2p_enabled = false will silently lose this metric after this change.

🟢 Good: behavior fix for `no_tx_pool` path

crates/builder/src/flashblocks/builder.rs:

Previously, built_fb_payload_tx.try_send(fb_payload) for the initial flashblock was guarded only by !rebuild_external_payload, meaning syncing FCUs (where no_tx_pool=true) would still trigger p2p gossip. The new code correctly gates it with !no_tx_pool && !rebuild_external_payload, consistent with the comment "not emitting flashblock if no_tx_pool in FCU, it's just syncing". This is a correct fix.

🟢 Good: `Message` trait → enum refactor

Removing the Message generic trait in favor of a concrete Message enum in types.rs is the right call for an internal protocol — reduces boilerplate, removes the generic parameter from Node<M>, and makes deserialization straightforward (serde_json::from_str::<Message>). The types.rs module is well-organized.

Minor: `Default` removed for `NodeBuilder`

Since NodeBuilder::new() now requires ws_pub and metrics, removing Default is correct. Worth verifying no external call sites use NodeBuilder::default() (doesn't appear to be any in the diff, but worth a quick grep if not already done).

Test coverage note

The test in p2p/mod.rs now uses OpFlashblockPayload::default() as the round-trip message and binds WebSocketPublisher to 127.0.0.1:0. Consider adding a test case that verifies the WS publish is called (or not) at the right point relative to p2p gossip, once the ordering fix is applied.

TL;DR: The main concern is the WS-before-p2p ordering in Node::run(), which inverts the core atomicity guarantee the PR is trying to establish. Suggest swapping ws_pub.publish() to run after broadcast_message(), and converting the ? to a warn! to avoid killing the p2p node on publish failure.

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… flag Rename p2p/ module to broadcast/ to better reflect its dual role (p2p gossip + WS publishing). Move wspub.rs into broadcast/, remove the p2p_enabled config flag so the broadcast node always runs, fix WS publish ordering to run after p2p gossip, and improve error handling so ws_pub failures warn instead of crashing the node. Update stale comments referencing the removed p2p_enabled flag. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sieniven

@claude please help to review again, check that all your previous mentioned points have been resolved

sieniven · 2026-03-23T09:01:36Z

@claude please help to review again, check that all your previous mentioned points have been resolved

github-actions · 2026-03-23T09:01:51Z

Claude finished @sieniven's task in 2m 13s —— View job

PR Re-Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

Gather context and read changed files
Check: WS publish ordering vs p2p gossip (🔴 Critical)
Check: ws_pub.publish()? error propagation (🟡 Medium)
Check: flashblock_byte_size_histogram metric with p2p disabled (🟡 Medium)
Check: Minor notes (NodeBuilder Default, test coverage)
Provide updated review feedback

All three previously raised issues have been resolved. Details below.

✅ Fixed: WS publish ordering (was 🔴 Critical)

crates/builder/src/broadcast/mod.rs, lines 169–181:

if let Err(e) = outgoing_streams_handler.broadcast_message(message.clone()).await {
    warn!(...)  // p2p first
}
if let Message::OpFlashblockPayload(ref fb_payload) = message {
    match ws_pub.publish(fb_payload) { ... }  // WS second
}

broadcast_message (p2p gossip) now runs before ws_pub.publish (WS). The ordering now matches the stated atomicity guarantee. ✅

✅ Fixed: `ws_pub.publish()?` error propagation (was 🟡 Medium)

The ? is replaced with a match that logs a warn! on failure — consistent with the follower path in handler.rs. The p2p node will no longer terminate if the WS broadcast channel has no receivers. ✅

✅ Fixed: Metric coverage when p2p disabled (was 🟡 Medium)

The p2p_enabled flag has been removed from args/op.rs. The broadcast node is now always spawned unconditionally in service.rs, so flashblock_byte_size_histogram is always recorded via Node::run(). ✅

✅ Resolved: `NodeBuilder::default()` call sites

No usages of NodeBuilder::default() exist in the codebase. service.rs correctly uses NodeBuilder::new(ws_pub, metrics). ✅

Data flow looks correct end-to-end

Leader path: builder → built_fb_payload_tx → handler built_fb_payload_rx → p2p_tx → Node::run() outgoing_message_rx → p2p gossip first → WS publish second ✅
Follower path: Node::run() incoming stream → handler p2p_rx → cache update → WS publish ✅

No double-publishing between the two paths (outgoing vs. incoming are separate channels). ✅

Open (non-blocking): Test coverage

The test in broadcast/mod.rs validates node connectivity and message round-tripping. A dedicated test asserting WS publish is called after p2p gossip would make the ordering guarantee explicit and regression-proof, but this is not a blocker.

Summary: All three critical/medium issues from the previous review are fully addressed. The PR achieves its stated goal — p2p gossip to follower sequencers runs as a blocking operation before WS publish to RPC subscribers. LGTM.

…ntics 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When `broadcast_message` fails to write to a peer's stream, the stream was silently dropped but the peer entry remained in `StreamsHandler`, causing every subsequent broadcast to emit a `no stream for protocol` warning indefinitely — confirmed in production logs. Two-part fix: - `broadcast_message` now carries the `PeerId` through the error path, evicts the peer from the map on failure, and returns the list of failed peers to the caller instead of swallowing the error. - `Node::run` immediately attempts `open_stream` for each failed peer after every broadcast, recovering the yamux stream on the existing TCP connection without waiting for the 60 s retry tick. Also adds two unit tests in `mod.rs` that drive real libp2p swarms: `broadcast_evicts_peer_on_stream_failure` and `broadcast_returns_empty_failed_peers_on_success`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Speed up P2P peer reconnection by lowering the default retry interval, improving flashblocks broadcast recovery after transient disconnects. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace test-only StreamProtocol constants with the actual FLASHBLOCKS_STREAM_PROTOCOL to ensure tests exercise the same protocol negotiation as production code. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Namespace all WebSocketPublisher log messages under payload_builder::broadcast for easier filtering and log segregation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sieniven · 2026-03-26T04:34:48Z

@claude review the PR again

github-actions · 2026-03-26T04:35:05Z

Claude finished @sieniven's task in 2m 35s —— View job

PR Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

Gather context from previous reviews
Read current state of changed files
Analyze for any remaining or new issues
Provide review feedback

All three issues from the first review were confirmed fixed in the second review. This third pass verifies nothing has regressed and checks the latest commits (a338190, 4fcc2fc, 8236850, 8851252, 4d02bba).

✅ All previous issues remain fixed

WS ordering (broadcast/mod.rs:185–224): broadcast_message (p2p) runs first and fully awaits, then ws_pub.publish runs only after all peer writes have resolved. ✅

Error propagation: ws_pub.publish errors are handled with match ... Err(e) => warn!(...) — p2p node will not terminate on WS send failure. ✅

Metric coverage: p2p_enabled flag is gone from args/op.rs; broadcast node is always spawned; flashblock_byte_size_histogram is always recorded. ✅

New findings from latest commits

4d02bba — design comments: The added comment block at broadcast/mod.rs:170–184 is excellent. It clearly documents the "TCP kernel buffer" semantics and the intentional trade-off of not requiring ACKs from peers before WS publish. This makes the design intent explicit and reviewable.

8851252 — stream recovery on send failure: After broadcast_message returns failed_peers, the node immediately tries to re-open a new yamux stream to each failed peer (mod.rs:191–209). This is a nice improvement — it recovers from application-level stream closes without waiting for a full TCP reconnect cycle.

8236850 — peer retry interval reduced to 1s: Reasonable for a HA sequencer setup where failover latency matters.

4fcc2fc — real stream protocol in tests: Tests now use types::FLASHBLOCKS_STREAM_PROTOCOL instead of a fabricated protocol string, making them more faithful to production behavior.

One observation (non-blocking)

handler.rs:126 — p2p_tx.send(...).await creates backpressure

Some(payload) = built_fb_payload_rx.recv() => {
    let _ = p2p_tx.send(Message::from_flashblock_payload(payload)).await;
}

p2p_tx.send is an await-able call. If the broadcast node's outgoing_message_rx buffer fills up (capacity 100, mod.rs:448), this will block the handler's select loop, which would prevent built_payload_rx and p2p_rx from being polled. Eventually this causes built_fb_payload_tx.try_send() in builder.rs to return Err(Full) and flashblocks to be dropped. In practice the 100-message buffer should be far more than enough, so this is not likely to be a problem — just worth being aware of.

Data flow correctness (end-to-end)

Path	Flow	WS publish	Correct?
Leader	builder → `built_fb_payload_tx` → handler → `p2p_tx` → `Node::run()` → p2p gossip → WS publish	after p2p	✅
Follower	remote peer → `IncomingStreamsHandler` → handler `p2p_rx` → cache + WS publish	direct	✅

No double-publish between paths (outgoing vs. incoming channels are distinct). ✅

LGTM. The PR achieves its stated goal cleanly. The design trade-off (no delivery ACK from peers, TCP kernel-buffer as the "happens-before" boundary) is clearly documented in comments and is a reasonable choice for low-latency flashblock streaming.

If `broadcast_message` returns `Err` (message serialization failure), skip publishing the flashblock to WebSocket subscribers via `continue`. Previously the ws publish ran unconditionally after the match block. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Upholds the reorg-safety invariant from PR #210: WebSocket publish to RPC subscribers only happens if all follower sequencers received the flashblock via P2P. If any peer's stream send failed, that peer is a valid leader candidate without the flashblock — publishing to WS would cause a reorg on sequencer switch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the pre-loop `has_failed_peers` flag with iteration by reference (`for &peer_id in &failed_peers`), keeping the vec accessible after the loop so emptiness can be checked directly. PeerId is Copy so the destructuring pattern is idiomatic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sieniven · 2026-03-26T09:56:42Z

@claude review again

github-actions · 2026-03-26T09:57:03Z

Claude finished @sieniven's task in 1m 48s —— View job

PR Re-Review (4th pass): feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

Gather context from previous reviews
Read current state of changed files (latest commits: 9d98db6, c498f71, b07040a, 7d7f01b, a338190)
Analyze for any new issues or regressions
Provide review feedback

All previously raised issues remain fixed. This pass reviews the 5 newest commits added since the last review.

✅ All previous fixes still hold

WS ordering (broadcast/mod.rs:185): broadcast_message (p2p) still runs and fully resolves before any ws_pub.publish call. ✅

Error propagation: ws_pub.publish errors still handled with match ... Err(e) => warn!(...). ✅

Metric coverage: broadcast node always spawned, flashblock_byte_size_histogram always recorded. ✅

New commits analysis

b07040a — skip ws publish if any peer failed (mod.rs:210–212)

if !failed_peers.is_empty() {
    continue;  // skip WS publish entirely
}

This tightens the atomicity guarantee: WS publish only fires when ALL peers successfully received the flashblock. One flashblock per stream failure is dropped to ws subscribers. The trade-off is intentional and consistent with the PR's zero-reorg goal. Worth noting: after the continue, the stream recovery loop has already attempted to re-open streams (mod.rs:191–209), so the next flashblock should be delivered normally. The one dropped flashblock per failure is acceptable.

7d7f01b — skip ws publish on serialization failure (mod.rs:214–217)

Err(e) => {
    warn!(...);
    continue;  // skip WS publish on serialize error
}

Consistent with the "p2p-first" design — serialization failures now also suppress WS publish. Correct.

c498f71 — iterate failed peers by ref (mod.rs:191)

for &peer_id in &failed_peers {

Clean refactor. Avoids consuming failed_peers early so the is_empty() check at line 210 still has access to the vec. Correct.

a338190 — log target rename
All tracing::warn! calls in wspub.rs now use target: "payload_builder::broadcast". Consistent with the rest of the broadcast module.

9d98db6 — change log level to warn
Changed a debug or info log to warn level. Appropriate for a failure path.

One observation (non-blocking)

b07040a interaction with stream recovery latency (mod.rs:191–209)

Stream recovery calls open_stream(...).await inside the outgoing_message_rx match arm, for each failed peer × each protocol. While Tokio can yield during these awaits, swarm events (swarm.select_next_some()) won't be processed until the entire arm returns. If multiple peers fail simultaneously and stream recovery is slow (e.g. peer overloaded), there's a window where swarm events accumulate. This was noted in the previous review as non-blocking — it remains non-blocking here, as the scenario requires simultaneous stream failures at a frequency faster than recovery.

Data flow (still correct end-to-end)

Path	Flow	WS publish	Condition	Correct?
Leader	builder → handler → `p2p_tx` → `Node::run()` → p2p gossip → WS publish	after p2p	only if all peers succeeded	✅
Follower	remote peer → `IncomingStreamsHandler` → handler `p2p_rx` → WS publish	direct	always	✅

No double-publish between paths. ✅

LGTM. The latest commits correctly tighten the atomicity guarantee (skip WS on any peer failure) and clean up minor refactors. The PR achieves its stated goal cleanly.

feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

039d7e7

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sieniven and others added 2 commits March 23, 2026 15:22

fix(flashblocks): publish to websocket after p2p gossip broadcast

5ef3f42

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sieniven commented Mar 23, 2026

View reviewed changes

sieniven and others added 5 commits March 25, 2026 12:47

chore(flashblocks): add comments on broadcast design and failure sema…

4d02bba

…ntics 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sieniven and others added 4 commits March 26, 2026 13:37

change to warn level

9d98db6

brendontan03 approved these changes Mar 27, 2026

View reviewed changes

louisliu2048 merged commit e5f351b into main Mar 27, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip#210

feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip#210
louisliu2048 merged 12 commits intomainfrom
niven/flashblocks-no-reorg

sieniven commented Mar 20, 2026 •

edited

Loading

Uh oh!

sieniven commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026 •

edited

Loading

Uh oh!

sieniven left a comment

Uh oh!

sieniven commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

sieniven commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

sieniven commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sieniven commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reorg considerations

Uh oh!

sieniven commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

Summary

🔴 Critical: WS publishes before p2p gossip in Node::run() — opposite of the stated guarantee

🟡 Medium: ws_pub.publish()? kills the p2p node if broadcast channel has no receivers

🟡 Medium: flashblock_byte_size_histogram metric silently drops when p2p is disabled

🟢 Good: behavior fix for no_tx_pool path

🟢 Good: Message trait → enum refactor

Minor: Default removed for NodeBuilder

Test coverage note

Uh oh!

sieniven left a comment

Choose a reason for hiding this comment

Uh oh!

sieniven commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Re-Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

✅ Fixed: WS publish ordering (was 🔴 Critical)

✅ Fixed: ws_pub.publish()? error propagation (was 🟡 Medium)

✅ Fixed: Metric coverage when p2p disabled (was 🟡 Medium)

✅ Resolved: NodeBuilder::default() call sites

Data flow looks correct end-to-end

Open (non-blocking): Test coverage

Uh oh!

sieniven commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

✅ All previous issues remain fixed

New findings from latest commits

One observation (non-blocking)

Data flow correctness (end-to-end)

Uh oh!

sieniven commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Re-Review (4th pass): feat(flashblocks): shift atomic gossip op, wspub on seq after p2p gossip

✅ All previous fixes still hold

New commits analysis

One observation (non-blocking)

Data flow (still correct end-to-end)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sieniven commented Mar 20, 2026 •

edited

Loading

github-actions bot commented Mar 22, 2026 •

edited

Loading

🔴 Critical: WS publishes before p2p gossip in `Node::run()` — opposite of the stated guarantee

🟡 Medium: `ws_pub.publish()?` kills the p2p node if broadcast channel has no receivers

🟡 Medium: `flashblock_byte_size_histogram` metric silently drops when p2p is disabled

🟢 Good: behavior fix for `no_tx_pool` path

🟢 Good: `Message` trait → enum refactor

Minor: `Default` removed for `NodeBuilder`

github-actions bot commented Mar 23, 2026 •

edited

Loading

✅ Fixed: `ws_pub.publish()?` error propagation (was 🟡 Medium)

✅ Resolved: `NodeBuilder::default()` call sites

github-actions bot commented Mar 26, 2026 •

edited

Loading

github-actions bot commented Mar 26, 2026 •

edited

Loading